This analysis will explore a dataset on wine quality and physicochemical properties. The objective is to explore which chemical properties influence the quality of red wines.
The dataset contains red variants of the Portuguese “Vinho Verde” wine. Only physicochemical (inputs) and sensory (the output) variables are available.
fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
chlorides: the amount of salt in the wine
free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
density: the density of water is close to that of water depending on the percent alcohol and sugar content
pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
alcohol: the percent alcohol content of the wine
quality (score between 0 and 10)
This desription and more background information can be found here.
This analysis is based on: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236. The dataset is available here
Let’s get a first glimpse on the available variables and their distribution by plotting the Five-number summary extended by the mean.
str(df)
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
summary(df)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Defintion: most acids involved with wine or fixed or nonvolatile (do not evaporate readily). The fixed acidity is a right-skewed distribution, with a median of 7.9 g tartaric acid per liter. There are some outliers that have more than 12 g/L up to a maximum of 15.9 g/L.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Definition: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. The volatile acidity is almost a symmetric distribution with a few positive outliers. The median and mean are ~5.2 g acetic acid per liter. The distribuion of volatile acids is again skewed with outliers above 1.0 g/L up to almost 1.6 g/L.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Definition: found in small quantities, citric acid can add ‘freshness’ and flavor to wines. The citric acidity is a linear decreasing distribution with median of ~0.26 g citric acid per liter. Most of the wines (>150) do not contain any citric acid. This distribution should be investigated for correlations later on!
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Definition: the amount of sugar remaining after fermentation stops, it’srare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet. Most of the red wines in the dataset contain between 1.9 and 2.5 grams sugar per liter, while there is a long tail with wines that contain up to a maximum of 15 grams sugar per liter.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Definition: the amount of salt in the wine. Most wines contain between 0.07 and 0.09 g of salt per liter. Again there is a long tail with wines containing up to 0.6 g of salt per liter. There are some wines that contain a lot more chlorides than the majority being closely to median of 0.08 they contain more than 0.1 g/L and can clearly be seen in the distribution histogram and boxplot.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Definition: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. In the dataset the free sulfur dioxid is found in a right-skewed distribution, with an average of 16g per liter and 50% of the wines containing 7-21g per liter.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Definition: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.
The red wines contain at least 6 mg per liter and most wines (75%) contain no more than 62mg/l. The distribution is decreasing, with some very extreme outliers with more than 250mg/l.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Definition: the density of water is close to that of water depending on the percent alcohol and sugar content.
The density is normally distributed, with an average and mean ~1g/l.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
Definition: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
The pH value is normally distributed, with a few positive and negative outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
Definition: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant.
Sulphates is a right skewed distribution. Most wines contain between 0.3 and 0.7 g/l - with some outliers ranging up to 2g/l.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Definition: the percent alcohol content of the wine
Most wines contain between ~10% of alcohol. The distribution is skewed with more wines containing more than 10% then lower percentage of alcohol.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Defintion: quality is a ordinal metric, representing the median of a score given by experts which ranges between 0 and 10.
The experts moslty gave a rating of 5 or 6. Even though there are a more wines with a rating of 6 and higher than 5 and less the plot looks normally distributed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
As seen in the frequency diagramm above, the amount of wines that recieved a ranking <5 or >6 is negligible so I’ll bin them into the categories bad, OK, good and very good to get rid of the very few 3 & 8 ratings.
## bad OK good very good
## 63 681 638 217
There are 1599 observations of 13 numeric variables, with X being the ID it total there are 11 input variables and 1 output variable. The output variable quality is categorical, based on the median of at least 3 evaluations made by wine experts. Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). As seen in the summary above, the output (median) quality of those ratings ranged only from 3 to 8, with a mean of 5.6 and median of 6. The main feature of interest is the quality which is what in the end matters when buying and drinking wine. The most interesting variables are those that are not normal distributed: Citric Acid, Total sulfur dioxide and Alcohol. One new variable has been introducded: the rating. As the distrubution of the quality is mostly containing 5’s and 6’s and scatter below and above, the rating acts as a summary metric. By matching the quality in the 4 rating categories bad, OK, good & very good it can be used in further plots to reduce scatter without loosing too much detail.
The boxplots already hint towards some correlations:
Wines of higher quality have:
The other variables do not seem to have an impact on the quality of the wine.
Futher investigation of correlation of the variables against quality should be done using cor.test:
## fixed.acidity volatile.acidity citric.acid density
## 0.12405165 -0.39055778 0.22637251 -0.17491923
## pH log10.sulphates alcohol
## -0.05773139 0.30864193 0.47616632
The following variables have correlations to wine quality:
The pH-value does not correlate with wine quality.
Let’s see how these variables compare, plotted against each other and faceted by wine rating:
Most of the plots are nearly uniformly-distributed, but some observations could be made and will be explained in the analysis part.
The strongest correlation is between alcohol & quality which can be clearly seen in the boxplot with an ascending quality with increasing alcohol level. Also, following variables correlate with wine quality (in descending order):
There are two main observations when investigating the correlation between those variables:
Alcohol and density correlate negatively: the higher the alcohol the lower the density. This can easily be explained with the lower density of alcohol compared to water and if there is relatively more alcohol contained the density is thereby lower.
Fixed acidity and density correlate positively: the higher the fixed acidity the higher the density. This was suprising at first, as acidity does not correlate with the alcohol amount, which itself is correlating with the density. After doing a little research it seems the acidity has a chemical effect on the density, thus the correlation
The scatterplots examine the 6 variables we had identified correlating with the quality of wines. To reduce the clutter, they are faceted by rating. The volatile acidity appears to be rather low for a wine of good quality,no matter the amount of alcohol contained. Also the sulphate level & alcohol plot show a picture of wines of good quality beeing concetrated in a smaller area around 11.8% alcohol and sulfate levels around 0.75. When looking at the citric acid there is a very interesting gap of good wines at 0.25. Some of the wines that contain less were given a good quality rating by the judges as well as mostly those above. All the wines that contain between 0.19 and 0.25 of citric acid are considered average or bad.
The boxplots show a very clear trend for citric acid as well as volatile acidity on the qualtiy of red wine. Lower volatile acidity and higher citric acidity lead to better wine quality. Those two forms of acid seem to cancel each other out, as they’re both influencing the fixed acidity and the pH value of the wine which does not show a clear trend which could be linked to wine quality. The boxplots again show the high impact alcohol has on the quality of the wines, especially when keeping in mind that it showed the highest PCC.
Exploring the rating, the impact of alcohol volume becomes even clearer. These boxplots clearly show the effect of alcohol content on the quality of a wine. Even though there are outliers in the group of wines rated ‘OK’, in general a higher amount of alcohol is an indicator for a wine of good quality.
In this graph only the very good and bad wines had been considered after seeing the same trend not that clearly when considering wines with all types of rating. This summarizes the strongest findings that were made: For a wine to be of good quality, dependens on a low volatile acidity and a high amount of alcohol contained.
The analysis of wine quality identified 6 different variables that correlate with red wine quality: alcohol, volatile acidity, sulphates, citric acid, density and fixed acidity. The alcohol contained as well as the volatile acidity are those variables that have the strongest impact on the red wine quality. Nonetheless it is imporant to keep in mind that this dataset contains only wines of a certain region and there could be regional differences that change the impact of e.g. volatile acidity on red wines from Portugal compared to Australia. As volatile acidity is hard to measure when buying a red wine, it might be worth going for the red wine with higher alcohol amount the next time when in doubt of which red wine to buy.